Predicting Aviation Accident Severity: A Data-Driven Approach to Enhancing Air Travel Safety
¶

Done by:¶

- Pramod Kumar (UID:121032911)¶

- Swapnita Sahu (UID:121292223)¶

Table of Contents¶

  1. Introduction
  2. Importing Necessary Libraries
  3. Data Collection
  4. Downloading a dataset from Kaggle
  5. Data Import
  6. Data Cleaning
  7. Exploratory Data Analysis
    • Outlier Analysis
    • Distributions of Variables
    • Handling Skewness
    • Feature Correlations
  8. Bivariate Analysis
  9. Feature Scaling
  10. Model Building
    • Feature Selection
    • Train-Test Split
  11. Model Training
    • Baseline Models
    • Neural Network
  12. Conclusion

Introduction¶

The aviation industry is a global powerhouse, with millions of flights operated each year. Though air travel is one of the safest modes of transportation, yet the stakes remain high when accidents do occur. A single crash can ripple through economies, impact regulatory policies, and shift public perception of air travel. Understanding and predicting the severity of these crashes is not only crucial for enhancing safety protocols but also for minimizing the impact on passengers, crews, and the wider community. Moreover, as the aviation industry continues to evolve, with the introduction of new aircraft technologies and operational protocols, a data-driven approach to understanding potential crash outcomes becomes even more vital.

Recent events have underscored the importance of this issue. For instance, the tragic crash of a passenger plane in early 2024 raised questions about existing safety measures and the effectiveness of current predictive models. Investigations revealed that while initial crash predictions indicated low risk, unforeseen factors led to a disastrous outcome. Such incidents highlight the necessity for more robust predictive frameworks that can analyze various parameters—including weather conditions, human factors, and aircraft maintenance history—to provide more accurate severity assessments.

This project sits at the intersection of machine learning and real-world applications, showcasing the transformative power of data science in critical fields. By leveraging advanced algorithms and big data analytics, we can derive insights from vast amounts of historical flight data, accident reports, and environmental conditions. As we embark on this project, we aim not only to contribute to the body of knowledge in aviation safety but also to illustrate how data science can drive meaningful change in sectors that affect our daily lives. By harnessing the power of machine learning, we strive to create a safer future for air travel, ensuring that lessons learned from past incidents lead to actionable insights that protect lives.

Importing Necessary Libraries¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
import os
from scipy import stats

from google.colab import files
import os

import warnings
warnings.filterwarnings(action= 'ignore')

To get started with this tutorial, the first step is to import the essential Python libraries, as demonstrated above. These libraries will be instrumental throughout the process. Using Jupyter Notebook is highly recommended for this tutorial.

A key library we’ll work with is Pandas, a powerful open-source tool for data analysis built on Python. It offers a user-friendly and flexible approach to data manipulation, allowing us to perform a variety of transformations effortlessly.

Another critical library we'll utilize is NumPy, which is designed for high-performance computations on large datasets. It provides a robust framework for storing, processing, and performing complex operations on data, streamlining the analysis process.

Setting a common style to visualize plots¶

In [ ]:
sns.set_theme(style="whitegrid", palette="deep")

Data Collection¶

The Airplane Accidents Severity Dataset on Kaggle provides detailed information on airplane accidents that occurred between 2010 and 2018. It consists of two CSV files: "train.csv" and "test.csv". The training dataset contains 10,000 rows and 13 columns, while the testing dataset includes 2,725 rows and 12 columns. Each row corresponds to a unique airplane accident. The dataset features the following columns:

  • Accident_ID: A unique identifier assigned to each accident.
  • Accident_Type_Code: A numerical code indicating the type of accident (e.g., "1" for "Controlled Flight Into Terrain," "2" for "Loss of Control In Flight," etc.).
  • Cabin_Temperature: The cabin temperature at the time of the accident, measured in degrees Celsius.
  • Turbulence_In_gforces: The g-force experienced by the aircraft during the incident.
  • Control_Metric: A measure of the pilot's ability to maintain control during the accident.
  • Total_Safety_Complaints: The total number of safety complaints filed against the airline in the 12 months leading up to the accident.
  • Days_Since_Inspection: The number of days since the aircraft's last inspection.
  • Safety_Score: A metric that evaluates the overall safety performance of the airline.
  • Severity: The severity level of the accident, categorized as "Minor_Damage_And_Injuries," "Significant_Damage_And_Fatalities," "Significant_Damage_And_Serious_Injuries," or "Highly_Fatal_And_Damaging."
  • Accident_Type_Description: A detailed description of the type of accident.
  • Max_Elevation: The highest altitude achieved by the aircraft during the flight.
  • Violations: The number of safety violations recorded for the airline in the 12 months preceding the accident.
  • Adverse_Weather_Metric: A metric assessing weather conditions during the time of the accident.

Downloading a dataset from Kaggle¶

  • Create your API token in your kaggle account.
  • Download your token as kaggle.json
  • Upload it in google colab.
  • Create a root directory for kaggle in your directory, upload the .json file here.
  • Give it permission.
  • Download the dataset
  • Unzip it and start using it.
In [ ]:
from google.colab import files
files.upload()  # Choose the kaggle.json file from your local machine
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving kaggle.json to kaggle.json
Out[ ]:
{'kaggle.json': b'{"username":"deadsalvatore","key":"3786bdd2c0eb4f6c1642a2a10473d31f"}'}
In [ ]:
import os
# Make directory to store Kaggle API token
os.makedirs('/root/.kaggle', exist_ok=True)

# Move kaggle.json file to the .kaggle directory
!mv kaggle.json /root/.kaggle/
# Set permissions for the file to secure it
!chmod 600 /root/.kaggle/kaggle.json
In [ ]:
!kaggle datasets download -d "kaushal2896/airplane-accidents-severity-dataset"
Dataset URL: https://www.kaggle.com/datasets/kaushal2896/airplane-accidents-severity-dataset
License(s): unknown
Downloading airplane-accidents-severity-dataset.zip to /content
  0% 0.00/547k [00:00<?, ?B/s]
100% 547k/547k [00:00<00:00, 124MB/s]
In [ ]:
# Unzip the downloaded dataset
!unzip airplane-accidents-severity-dataset.zip -d /content/airplane-accidents-severity-dataset
Archive:  airplane-accidents-severity-dataset.zip
  inflating: /content/airplane-accidents-severity-dataset/sample_submission.csv  
  inflating: /content/airplane-accidents-severity-dataset/test.csv  
  inflating: /content/airplane-accidents-severity-dataset/train.csv  

Data Import¶

In [ ]:
df_train = pd.read_csv('/content/airplane-accidents-severity-dataset/train.csv')
df_train.head()
Out[ ]:
Severity Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
0 Minor_Damage_And_Injuries 49.223744 14 22 71.285324 0.272118 78.04 2 31335.47682 3 0.424352 7570.0
1 Minor_Damage_And_Injuries 62.465753 10 27 72.288058 0.423939 84.54 2 26024.71106 2 0.352350 12128.0
2 Significant_Damage_And_Fatalities 63.059361 13 16 66.362808 0.322604 78.86 7 39269.05393 3 0.003364 2181.0
3 Significant_Damage_And_Serious_Injuries 48.082192 11 9 74.703737 0.337029 81.79 3 42771.49920 1 0.211728 5946.0
4 Significant_Damage_And_Fatalities 26.484018 13 25 47.948952 0.541140 77.16 3 35509.22852 2 0.176883 9054.0
In [ ]:
df_test = pd.read_csv('/content/airplane-accidents-severity-dataset/test.csv')
df_test.head()
Out[ ]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
0 19.497717 16 6 72.151322 0.388959 78.32 4 37949.724386 2 0.069692 1
1 58.173516 15 3 64.585232 0.250841 78.60 7 30194.805567 2 0.002777 10
2 33.287671 15 3 64.721969 0.336669 86.96 6 17572.925484 1 0.004316 14
3 3.287671 21 5 66.362808 0.421775 80.86 3 40209.186341 2 0.199990 17
4 10.867580 18 2 56.107566 0.313228 79.22 2 35495.525408 2 0.483696 21
In [ ]:
df_train.shape
Out[ ]:
(10000, 12)
In [ ]:
df_train.describe()
Out[ ]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
count 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 10000.00000 10000.000000 9990.000000
mean 42.009397 12.931100 6.564300 65.016036 0.381495 79.810178 3.814900 32001.803282 2.01220 0.255635 6266.772773
std 17.136684 3.539803 6.971982 12.498113 0.121301 4.513441 1.902577 9431.995196 1.03998 0.381128 3610.867005
min -78.000000 1.000000 0.000000 -97.000000 0.134000 0.000000 1.000000 831.695553 0.00000 0.000316 2.000000
25% 30.570776 11.000000 2.000000 56.927985 0.293665 77.950000 2.000000 25757.636910 1.00000 0.012063 3138.250000
50% 41.278539 13.000000 4.000000 65.587967 0.365879 79.530000 4.000000 32060.336420 2.00000 0.074467 6280.500000
75% 52.511416 15.000000 9.000000 73.336372 0.451346 81.560000 5.000000 38380.641515 3.00000 0.354059 9393.750000
max 199.000000 23.000000 54.000000 100.000000 0.882648 97.510000 7.000000 64297.651220 5.00000 2.365378 12500.000000
In [ ]:
df_test.shape
Out[ ]:
(2500, 11)
In [ ]:
df_test.describe()
Out[ ]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
count 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000 2500.000000
mean 41.825224 12.946400 6.574800 65.368058 0.376197 79.993068 3.853600 32383.134179 1.990800 0.250886 6186.283200
std 16.280187 3.523364 7.179542 11.442005 0.116960 2.713833 1.877652 9485.096436 1.018592 0.387663 3602.235035
min 0.000000 1.000000 0.000000 20.966272 0.143376 74.740000 1.000000 831.695553 0.000000 0.000368 1.000000
25% 30.593607 11.000000 1.000000 57.702826 0.292583 77.930000 2.000000 26008.851717 1.000000 0.013136 3071.750000
50% 41.461187 13.000000 4.000000 66.066545 0.357404 79.600000 4.000000 32472.865497 2.000000 0.072466 6159.500000
75% 52.751142 15.000000 9.000000 73.119872 0.441699 81.530000 5.000000 38759.519071 3.000000 0.315407 9309.250000
max 100.000000 23.000000 54.000000 97.994531 0.881926 94.200000 7.000000 62315.408444 5.000000 2.365378 12493.000000

Data Cleaning¶

  • The code given below checks for missing values in the training and testing datasets by summing the null entries for each column.
  • It helps identify which columns have missing values and their extent.
In [ ]:
df_train.isnull().sum()
Out[ ]:
0
Severity 0
Safety_Score 0
Days_Since_Inspection 0
Total_Safety_Complaints 0
Control_Metric 0
Turbulence_In_gforces 0
Cabin_Temperature 0
Accident_Type_Code 0
Max_Elevation 0
Violations 0
Adverse_Weather_Metric 0
Accident_ID 10

  • The Accident_ID column has 10 missing values. Considering the dataset contains 10,000 records, removing these 10 rows is a reasonable approach. This represents only 0.1% of the data, and the impact on the analysis or model performance will be negligible.

  • By removing Accident_ID and Accident_Type_Code, we ensure the dataset maintains its integrity and avoids issues caused by missing values, ultimately resulting in a more reliable dataset for further analysis or machine learning tasks.

In [ ]:
# Removing rows where Accident_ID is null
df_train = df_train.dropna(subset=['Accident_ID'])

# Verifying the changes
print(df_train.info())
<class 'pandas.core.frame.DataFrame'>
Index: 9990 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Severity                 9990 non-null   object 
 1   Safety_Score             9990 non-null   float64
 2   Days_Since_Inspection    9990 non-null   int64  
 3   Total_Safety_Complaints  9990 non-null   int64  
 4   Control_Metric           9990 non-null   float64
 5   Turbulence_In_gforces    9990 non-null   float64
 6   Cabin_Temperature        9990 non-null   float64
 7   Accident_Type_Code       9990 non-null   int64  
 8   Max_Elevation            9990 non-null   float64
 9   Violations               9990 non-null   int64  
 10  Adverse_Weather_Metric   9990 non-null   float64
 11  Accident_ID              9990 non-null   float64
dtypes: float64(7), int64(4), object(1)
memory usage: 1014.6+ KB
None
In [ ]:
testing2= df_test.drop(['Accident_Type_Code'], axis=1)
testing2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Safety_Score             2500 non-null   float64
 1   Days_Since_Inspection    2500 non-null   int64  
 2   Total_Safety_Complaints  2500 non-null   int64  
 3   Control_Metric           2500 non-null   float64
 4   Turbulence_In_gforces    2500 non-null   float64
 5   Cabin_Temperature        2500 non-null   float64
 6   Max_Elevation            2500 non-null   float64
 7   Violations               2500 non-null   int64  
 8   Adverse_Weather_Metric   2500 non-null   float64
 9   Accident_ID              2500 non-null   int64  
dtypes: float64(6), int64(4)
memory usage: 195.4 KB
In [ ]:
#Print their respective shapes
print("Shape of training data is:", df_train.shape)
print("Shape of testing data is:", testing2.shape)
Shape of training data is: (9990, 12)
Shape of testing data is: (2500, 10)
In [ ]:
df_train['Severity'].value_counts()
Out[ ]:
count
Severity
Highly_Fatal_And_Damaging 3038
Significant_Damage_And_Serious_Injuries 2716
Minor_Damage_And_Injuries 2505
Significant_Damage_And_Fatalities 1686
Minor_Damage_And_Injry 8
Minor_Damage_And_Injuries 7
Highly_Fatal_And_Damagin 4
Significant_Damage_And_Serious_Injry 4
Sigificant_Damage_And_Fatalities 4
Highly_Fatal_And_Dmg 4
Sigificant_Damage_And_Serious_Injuries 3
Minor_Damge_And_Injuries 3
Highly_Fatl_And_Damaging 3
Significant_Damage_And_Fatalty 3
Significant_Damge_And_Serious_Injuries 2

We count the occurrences of each unique value in the Severity column to understand its distribution.

In [ ]:
!pip install fuzzywuzzy
Requirement already satisfied: fuzzywuzzy in /usr/local/lib/python3.10/dist-packages (0.18.0)
In [ ]:
from fuzzywuzzy import process

# Define the list of correct categories
valid_categories = [
    'Highly_Fatal_And_Damaging',
    'Significant_Damage_And_Serious_Injuries',
    'Minor_Damage_And_Injuries',
    'Significant_Damage_And_Fatalities'
]

# Function to match each value to the closest valid category
def match_severity(value):
    return process.extractOne(value, valid_categories)[0]

# Apply the function to the 'Severity' column
df_train['Severity'] = df_train['Severity'].apply(match_severity)

# Verify the changes
print(df_train['Severity'].value_counts())
Severity
Highly_Fatal_And_Damaging                  3049
Significant_Damage_And_Serious_Injuries    2725
Minor_Damage_And_Injuries                  2523
Significant_Damage_And_Fatalities          1693
Name: count, dtype: int64
  • The above code installs the fuzzywuzzy library, which helps with string matching and correction.
  • The process module is used to find the closest matching string from a list of valid categories.
  • Defines a list of valid severity categories.
  • Applies fuzzy matching to the Severity column to correct any inconsistent or misspelled entries.
  • Ensures that the Severity column values align with the predefined valid categories.
In [ ]:
df_train['Cabin_Temperature'].value_counts()
Out[ ]:
count
Cabin_Temperature
78.46 48
80.98 43
78.37 42
79.17 41
81.26 40
... ...
86.48 1
75.15 1
85.25 1
80.10 1
85.31 1

951 rows × 1 columns


We count the frequency of each unique value in the Cabin_Temperature column.

In [ ]:
# Filter rows where Cabin_Temperature is 0
cabin_temp_zero = df_train[df_train['Cabin_Temperature'] == 0]

# Display the rows with Cabin_Temperature = 0
cabin_temp_zero
Out[ ]:
Severity Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric Accident_ID
162 Significant_Damage_And_Serious_Injuries 57.168950 8 5 88.878760 0.213697 0.0 6 14287.47546 2 0.003415 253.0
624 Minor_Damage_And_Injuries 41.735160 16 9 76.253418 0.387877 0.0 7 19804.49457 1 0.001585 9059.0
1061 Significant_Damage_And_Fatalities 20.228310 15 1 55.970830 0.301688 0.0 4 49056.75108 2 0.091419 2575.0
2135 Minor_Damage_And_Injuries 29.497717 20 10 57.611668 0.366600 0.0 4 38630.47568 1 0.070887 3458.0
2442 Highly_Fatal_And_Damaging 33.789954 11 5 72.880583 0.406990 0.0 4 25564.94431 2 0.047157 11395.0
2827 Minor_Damage_And_Injuries 48.264840 14 1 77.347311 0.256611 0.0 2 42688.63692 2 0.577209 4954.0
2917 Highly_Fatal_And_Damaging 38.858447 17 7 55.059253 0.338472 0.0 4 39523.66580 0 0.072389 3533.0
3984 Highly_Fatal_And_Damaging 27.077626 14 1 69.371012 0.311425 0.0 2 35553.18579 3 0.482866 5121.0
4143 Significant_Damage_And_Fatalities 23.470320 14 6 53.509572 0.264545 0.0 1 18799.48957 2 0.691709 7880.0
4493 Minor_Damage_And_Injuries 43.013699 16 2 67.775752 0.313228 0.0 2 27412.26472 3 0.368581 611.0
5617 Significant_Damage_And_Serious_Injuries 44.520548 12 0 72.515953 0.416726 0.0 3 31526.63293 4 0.155474 4797.0
5789 Minor_Damage_And_Injuries 59.406393 11 32 76.937101 0.381746 0.0 2 32584.01461 2 0.440815 7251.0
6183 Highly_Fatal_And_Damaging 1.415525 22 2 63.901550 0.353618 0.0 3 25720.91641 2 0.128896 8846.0
6202 Significant_Damage_And_Serious_Injuries 55.616438 8 12 77.666363 0.276084 0.0 5 53200.86315 2 0.035165 10523.0
7328 Highly_Fatal_And_Damaging 13.789954 17 1 63.400182 0.585497 0.0 4 40073.72017 1 0.074291 7808.0
7568 Significant_Damage_And_Serious_Injuries 57.945205 8 16 51.002735 0.439445 0.0 3 33284.93084 2 0.166010 5043.0
8065 Significant_Damage_And_Serious_Injuries 24.885845 18 9 82.907931 0.337750 0.0 6 32328.21992 2 0.007654 1680.0
8134 Highly_Fatal_And_Damaging 25.159817 14 11 65.587967 0.236777 0.0 1 22730.47091 0 0.832144 2120.0
8149 Significant_Damage_And_Serious_Injuries 48.721461 10 1 61.212397 0.368043 0.0 3 22475.12587 2 0.112245 192.0
8674 Significant_Damage_And_Serious_Injuries 37.305936 14 0 57.520510 0.244350 0.0 3 45969.45269 0 0.228432 5235.0

The above code filters and displays rows where Cabin_Temperature is 0, which might indicate incorrect or missing data.

In [ ]:
# Calculate the median of Cabin_Temperature excluding zeros
median_temp = df_train.loc[df_train['Cabin_Temperature'] != 0, 'Cabin_Temperature'].median()

# Replace all Cabin_Temperature = 0 with the median
df_train.loc[df_train['Cabin_Temperature'] == 0, 'Cabin_Temperature'] = median_temp

# Verify the changes
print(f"Median used for replacement: {median_temp}")
df_train['Cabin_Temperature'].value_counts()
Median used for replacement: 79.54
Out[ ]:
count
Cabin_Temperature
78.46 48
80.98 43
78.37 42
79.17 41
81.26 40
... ...
84.95 1
89.29 1
85.25 1
80.10 1
85.31 1

950 rows × 1 columns


  • We calculate the median of the Cabin_Temperature column, excluding rows where the value is 0.
  • Median is chosen as it is less sensitive to outliers compared to the mean, ensuring a robust replacement value.
  • All occurrences of 0 are replaced in the Cabin_Temperature column with the calculated median. This ensures the dataset does not contain invalid values while preserving the column's overall distribution.
  • We confirm the value used for replacement and verifies the updated frequency distribution of Cabin_Temperature.
In [ ]:
# Calculate the median of Control Metric excluding zeros
median_temp = df_train.loc[df_train['Control_Metric'] != 0, 'Control_Metric'].median()

# Replace all Cabin_Temperature = 0 with the median
df_train.loc[df_train['Control_Metric'] == 0, 'Control_Metric'] = median_temp

# Verify the changes
print(f"Median used for replacement: {median_temp}")
df_train['Control_Metric'].value_counts()
Median used for replacement: 65.58796718
Out[ ]:
count
Control_Metric
72.014585 64
63.901550 56
57.520510 52
62.078396 46
65.132179 44
... ...
-68.000000 1
39.699180 1
93.801276 1
-97.000000 1
73.792160 1

960 rows × 1 columns


In [ ]:
# Calculate the median of Control Metric excluding zeros
median_temp = df_train.loc[df_train['Safety_Score'] != 0, 'Safety_Score'].median()

# Replace all Cabin_Temperature = 0 with the median
df_train.loc[df_train['Safety_Score'] == 0, 'Safety_Score'] = median_temp

# Verify the changes
print(f"Median used for replacement: {median_temp}")
df_train['Safety_Score'].value_counts()
Median used for replacement: 41.32420091
Out[ ]:
count
Safety_Score
38.447489 42
40.776256 38
28.904110 35
42.100457 34
39.817352 33
... ...
-26.000000 1
23.333333 1
153.000000 1
-12.000000 1
7.945205 1

1202 rows × 1 columns


In [ ]:
df_train= df_train.drop(['Accident_ID'], axis=1)
df_train.head()
Out[ ]:
Severity Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric
0 Minor_Damage_And_Injuries 49.223744 14 22 71.285324 0.272118 78.04 2 31335.47682 3 0.424352
1 Minor_Damage_And_Injuries 62.465753 10 27 72.288058 0.423939 84.54 2 26024.71106 2 0.352350
2 Significant_Damage_And_Fatalities 63.059361 13 16 66.362808 0.322604 78.86 7 39269.05393 3 0.003364
3 Significant_Damage_And_Serious_Injuries 48.082192 11 9 74.703737 0.337029 81.79 3 42771.49920 1 0.211728
4 Significant_Damage_And_Fatalities 26.484018 13 25 47.948952 0.541140 77.16 3 35509.22852 2 0.176883

We removed the Accident_ID column from the training dataset.

  • Accident_ID: A unique identifier that does not contribute directly to the analysis or model building.

Exploratory Data Analysis¶

The dataset is now clean, consistent, and ready for reliable analysis or modeling

Outlier Analysis¶

In [ ]:
# Prepare numerical data for boxplots
num_df = df_train[['Safety_Score', 'Control_Metric', 'Turbulence_In_gforces',
                   'Cabin_Temperature', 'Max_Elevation', 'Adverse_Weather_Metric', 'Total_Safety_Complaints']]

# Set the number of rows and columns for the grid
num_cols = 2  # 2 boxplots per row
num_plots = len(num_df.columns)
rows = (num_plots + num_cols - 1) // num_cols  # Calculate required rows

# Create the figure and axes
fig, axes = plt.subplots(rows, num_cols, figsize=(14, rows * 4))  # Adjust size dynamically
axes = axes.flatten()  # Flatten axes array for easier indexing

# Plot each variable as a boxplot
for i, col in enumerate(num_df.columns):
    sns.boxplot(data=num_df[col], color='skyblue', width=0.6, ax=axes[i])
    axes[i].set_title(f'Boxplot of {col}', fontsize=14, color='black', pad=10)  # Title
    axes[i].set_xlabel(col, fontsize=12, color='black', labelpad=10)  # X-axis label
    axes[i].set_ylabel('Value', fontsize=12, color='black', labelpad=10)  # Y-axis label
    axes[i].grid(visible=True, color='gray', linestyle='--', linewidth=0.5, alpha=0.6)  # Grid styling

# Hide any unused subplots
for j in range(num_plots, len(axes)):
    axes[j].set_visible(False)

# Adjust layout to fit everything nicely
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

The primary focus here is to visualize the distribution and detect potential outliers for key numerical variables using boxplots. Boxplots are particularly useful for summarizing the range, interquartile range, median, and identifying outliers in data.

This approach provides a comprehensive visual analysis of numerical data, enabling:

  • Identification of outliers that may skew the analysis or modeling.
  • Comparison of distributions across different features.
  • Quick insights into the data's structure and variability.

As can be seen from boxplots above, the data is prone to a lot of outliers especially variables like 'Total_Safety_Complaints', 'Adverse_Weather_Metric' and 'Turbulence_in_gforces'. Removing them does not make sense as it will lead to a lot of data loss. Let's see if we can improve the situation by transforming these variables

In [ ]:
#Let's map the Dependent variable to their respective categorial dummies
df_train['Severity']= df_train.Severity.map({'Minor_Damage_And_Injuries': '1', 'Significant_Damage_And_Fatalities': '2', 'Significant_Damage_And_Serious_Injuries': '3', 'Highly_Fatal_And_Damaging': '4'})
df_train.head()
Out[ ]:
Severity Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Accident_Type_Code Max_Elevation Violations Adverse_Weather_Metric
0 1 49.223744 14 22 71.285324 0.272118 78.04 2 31335.47682 3 0.424352
1 1 62.465753 10 27 72.288058 0.423939 84.54 2 26024.71106 2 0.352350
2 2 63.059361 13 16 66.362808 0.322604 78.86 7 39269.05393 3 0.003364
3 3 48.082192 11 9 74.703737 0.337029 81.79 3 42771.49920 1 0.211728
4 2 26.484018 13 25 47.948952 0.541140 77.16 3 35509.22852 2 0.176883

The dependent variable Severity is mapped to categorical dummy values to streamline analysis and ensure consistent representation. The mapping is as follows:

  • 'Minor_Damage_And_Injuries' → '1'
  • 'Significant_Damage_And_Fatalities' → '2'
  • 'Significant_Damage_And_Serious_Injuries' → '3'
  • 'Highly_Fatal_And_Damaging' → '4'

This transformation converts descriptive labels into numeric representations, simplifying visualization and modeling tasks.

In [ ]:
# Define the figure and axes with a specific size
fig, ax = plt.subplots(figsize=(12, 6))

# Create the count plot
sns.countplot(
    data=df_train,
    x='Severity',
    palette='coolwarm',
    order=df_train['Severity'].value_counts().index,  # Sort by frequency
    saturation=0.8,
    ax=ax  # Use the defined axis
)

# Add a title and labels
ax.set_title('Distribution of Severity', fontsize=16, fontweight='bold', pad=15, color='black')
ax.set_xlabel('Severity', fontsize=12, labelpad=10, color='black')
ax.set_ylabel('Count', fontsize=12, labelpad=10, color='black')

# Rotate x-axis labels for better readability
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, fontsize=10)
ax.tick_params(axis='y', labelsize=10)

# Adjust layout to remove extra space
fig.tight_layout()

# Display the plot
plt.show()
No description has been provided for this image

Insights from the Plot

  • The count plot provides a clear view of how the data is distributed across the four severity categories.
  • It highlights potential class imbalances, which are crucial to address in subsequent modeling steps, particularly for classification problems.

Distributions of Variables¶

In [ ]:
# Define the number of variables and create subplots
num_vars = num_df.columns
num_plots = len(num_vars)
rows = (num_plots + 2) // 3  # Arrange in a grid with 3 columns per row
fig, axes = plt.subplots(rows, 3, figsize=(15, rows * 4))  # Adjust size dynamically
axes = axes.flatten()  # Flatten the 2D array of axes for easier indexing

# Set a consistent theme
sns.set_theme(style="whitegrid")

# Plot each variable in its respective subplot
for i, var in enumerate(num_vars):
    sns.histplot(num_df[var], kde=True, color="skyblue", ax=axes[i])  # Use histplot with KDE
    axes[i].set_title(f"Distribution of {var}", fontsize=14, pad=10)
    axes[i].set_xlabel(var, fontsize=12)
    axes[i].set_ylabel("Frequency", fontsize=12)

# Hide any unused subplots
for j in range(num_plots, len(axes)):
    axes[j].set_visible(False)

# Adjust layout to remove unwanted spaces
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

Handling Skewness¶

Skewed distributions can adversely affect machine learning models by violating the assumptions of normality in some algorithms. To address this, specific transformations are applied to normalize the data.

Initial Visualization Histograms with KDE:

  • Each numerical variable is plotted using sns.histplot() with KDE (Kernel Density Estimate) overlays to observe the data distribution.
  • Left Skew: Variables like Control_Metric show a distribution with a longer tail on the left.
  • Right Skew: Variables such as Cabin_Temperature, Total_Safety_Complaints, Adverse_Weather_Metric, and Turbulence_In_gforces have distributions with a longer tail on the right.
In [ ]:
#Fixing the right skew
num_df['Total_Safety_Complaints'] = np.log(num_df['Total_Safety_Complaints']+1) #+1 cause the log here takes a negative value
num_df['Adverse_Weather_Metric'] = np.log(num_df['Adverse_Weather_Metric'])
num_df['Cabin_Temperature'] = np.log(num_df['Cabin_Temperature'])
num_df['Turbulence_In_gforces'] = np.log(num_df['Turbulence_In_gforces'])

#Fixing left skew
num_df['Control_Metric'] = np.power(num_df['Control_Metric'], 2)

Transformation of Skewed Variables

To normalize the distributions:

Right-Skewed Variables:

Log transformations are applied using np.log(), which compresses the right tail and spreads out values near zero. Variables Transformed:

  • Total_Safety_Complaints (added +1 to avoid logarithm of zero).
  • Adverse_Weather_Metric
  • Cabin_Temperature
  • Turbulence_In_gforces

Left-Skewed Variable:

A power transformation is applied to Control_Metric by squaring the values (np.power(x, 2)) to correct the skewness.

In [ ]:
# Define the number of variables and create subplots
num_vars = num_df.columns
num_plots = len(num_vars)
rows = (num_plots + 2) // 3  # Arrange in a grid with 3 columns per row
fig, axes = plt.subplots(rows, 3, figsize=(15, rows * 4))  # Adjust size dynamically
axes = axes.flatten()  # Flatten the 2D array of axes for easier indexing

# Set a consistent theme
sns.set_theme(style="whitegrid")

# Plot each variable in its respective subplot
for i, var in enumerate(num_vars):
    sns.histplot(num_df[var], kde=True, color="skyblue", ax=axes[i])  # Use histplot with KDE
    axes[i].set_title(f"Distribution of {var}", fontsize=14, pad=10)
    axes[i].set_xlabel(var, fontsize=12)
    axes[i].set_ylabel("Frequency", fontsize=12)

# Hide any unused subplots
for j in range(num_plots, len(axes)):
    axes[j].set_visible(False)

# Adjust layout to remove unwanted spaces
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

Post-Transformation Visualization

The histograms are re-plotted after the transformations:

  • Variables previously skewed to the right exhibit more symmetrical distributions post log transformation.
  • Control_Metric no longer shows left skewness after the power transformation.

Advantages of Transformation

  • Improved Model Performance:
    • Normalized distributions help in reducing model bias and variance.
    • Algorithms sensitive to distribution perform better with transformed data.
  • Enhanced Interpretability:
    • Correcting skewness ensures that summary statistics like mean and standard deviation better represent the data.

Feature Correlations¶

In [ ]:
# Calculate the correlation matrix
correlation = num_df.corr()

# Create a heatmap with better styling
plt.figure(figsize=(12, 8))  # Adjust the figure size
sns.set_theme(style="white")  # Use a clean white background theme

# Create the heatmap
heatmap = sns.heatmap(
    correlation,
    annot=True,  # Annotate each cell with the correlation value
    fmt=".2f",  # Format the numbers to 2 decimal places
    cmap="coolwarm",  # Use a color palette
    vmin=-1, vmax=1,  # Ensure the color range is consistent
    linewidths=0.5,  # Add thin lines between cells
    annot_kws={"size": 10, "color": "black"}  # Customize annotations
)

# Add title and labels
plt.title("Correlation Heatmap of Numerical Variables", fontsize=16, fontweight='bold', pad=15)
plt.xticks(fontsize=10, rotation=45, ha='right')  # Rotate x-axis labels for better readability
plt.yticks(fontsize=10, rotation=0)  # Keep y-axis labels horizontal

# Remove extra spaces and display
plt.tight_layout()
plt.show()
No description has been provided for this image

No variables show worrying levels of correlation to each other except 'Control_Metric' and 'Turbulence_In_gforces'. However, keeping them in the model yielded better results. Also, 0.6 is more closer to 0.5 than any extreme so i decided to keep them.

In [ ]:
#Let's put the entire dataset back together
Rem= df_train[['Days_Since_Inspection','Violations','Accident_Type_Code','Severity']]
train2= pd.concat([num_df, Rem], axis=1)
train2.head()
Out[ ]:
Safety_Score Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Adverse_Weather_Metric Total_Safety_Complaints Days_Since_Inspection Violations Accident_Type_Code Severity
0 49.223744 5081.597362 -1.301521 4.357222 31335.47682 -0.857192 3.135494 14 3 2 1
1 62.465753 5225.563379 -0.858166 4.437225 26024.71106 -1.043130 3.332205 10 2 2 1
2 63.059361 4404.022241 -1.131328 4.367674 39269.05393 -5.694652 2.833213 13 3 7 2
3 48.082192 5580.648392 -1.087586 4.404155 42771.49920 -1.552452 2.302585 11 1 3 3
4 26.484018 2299.101968 -0.614077 4.345881 35509.22852 -1.732265 3.258097 13 2 3 2

Bivariate analysis¶

  1. Comparises more than one attribute in a graph.
  2. Visualization of graph.
  3. Uncover hidden pattern and relation between the attributes.

Here we describe which attribute is Categorical and Quantitative:

Categorical:

  1. Violations.

  2. Accident_Type_Code.

  3. Days_Since_Inspection.

  4. Target Variable: Severity

Quantitative:

  1. Adverse_Weather_Metric

  2. Max_Elevation

  3. Cabin_Temperature

  4. Turbulence_In_gforces

  5. Control_Metric

  6. Total_Safety_Complaints

  7. Safety_Score

Compare both Categorical and Quantitative attributes together.¶

1. Adverse_Weather_Metric¶

In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette for better aesthetics
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Adverse Weather Metric
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Adverse Weather Metric
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Adverse Weather Metric
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for improved alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette for better aesthetics
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Adverse Weather Metric
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Adverse Weather Metric
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Adverse Weather Metric
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Adverse_Weather_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Adverse Weather Metric', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Adverse Weather Metric', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for improved alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

2. Max_Elevation¶

In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14,12))

# Custom color palette
custom_palette = sns.color_palette("Spectral")

# Plot 1: Violations vs Max Elevation
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Max Elevation', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Max Elevation
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Max Elevation', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Max Elevation
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Max Elevation', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjust layout for better alignment
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette for professional visuals
custom_palette = sns.color_palette("husl")

# Plot 1: Violations vs Max Elevation
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Max Elevation', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Max Elevation
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Max Elevation', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Max Elevation
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Max_Elevation',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Max Elevation', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Max Elevation (ft)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjust layout for better spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

3. Cabin_Temperature¶

In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("Spectral")

# Plot 1: Violations vs Cabin Temperature
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Cabin Temperature', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Cabin Temperature
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Cabin Temperature', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Cabin Temperature
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Cabin Temperature', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Cabin Temperature
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Cabin Temperature', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Cabin Temperature
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Cabin Temperature', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Cabin Temperature
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Cabin_Temperature',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Cabin Temperature', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Cabin Temperature (°C)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

4. Turbulence_In_gforces¶

In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("Set2")

# Plot 1: Violations vs Turbulence In g-forces
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Turbulence In g-forces
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Turbulence In g-forces
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Turbulence In g-forces
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Turbulence In g-forces
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Turbulence In g-forces
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Turbulence_In_gforces',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Turbulence In g-forces', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Turbulence (g-forces)', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

5. Control_Metric¶

In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("Spectral")

# Plot 1: Violations vs Control Metric
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Control Metric', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Control Metric', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Control Metric
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Control Metric', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Control Metric', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Control Metric
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Control Metric', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Control Metric', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Control Metric
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Control Metric', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Control Metric', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Control Metric
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Control Metric', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Control Metric', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Control Metric
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Control_Metric',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Control Metric', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Control Metric', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

6. Total_Safety_Complaints¶

In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Total Safety Complaints
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Total Safety Complaints', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Total Safety Complaints', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Total Safety Complaints
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Total Safety Complaints', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Total Safety Complaints', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Total Safety Complaints
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Total Safety Complaints', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Total Safety Complaints', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Total Safety Complaints
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Total Safety Complaints', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Total Safety Complaints', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Total Safety Complaints
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Total Safety Complaints', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Total Safety Complaints', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Total Safety Complaints
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Total_Safety_Complaints',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Total Safety Complaints', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Total Safety Complaints', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image

7. Safety_Score¶

In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Safety Score
sns.boxplot(
    ax=axes[0],
    x='Violations',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Safety Score', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Safety Score', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Safety Score
sns.boxplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Safety Score', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Safety Score', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Safety Score
sns.boxplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Safety Score', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Safety Score', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [ ]:
fig, axes = plt.subplots(3, 1, figsize=(14, 12))

# Custom color palette
custom_palette = sns.color_palette("coolwarm")

# Plot 1: Violations vs Safety Score
sns.lineplot(
    ax=axes[0],
    x='Violations',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[0].set_title('Violations vs Safety Score', fontsize=14, weight='bold')
axes[0].set_xlabel('Violations', fontsize=12)
axes[0].set_ylabel('Safety Score', fontsize=12)
axes[0].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 2: Accident Type Code vs Safety Score
sns.lineplot(
    ax=axes[1],
    x='Accident_Type_Code',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[1].set_title('Accident Type Code vs Safety Score', fontsize=14, weight='bold')
axes[1].set_xlabel('Accident Type Code', fontsize=12)
axes[1].set_ylabel('Safety Score', fontsize=12)
axes[1].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Plot 3: Days Since Inspection vs Safety Score
sns.lineplot(
    ax=axes[2],
    x='Days_Since_Inspection',
    y='Safety_Score',
    data=train2,
    hue='Severity',
    palette=custom_palette
)
axes[2].set_title('Days Since Inspection vs Safety Score', fontsize=14, weight='bold')
axes[2].set_xlabel('Days Since Inspection', fontsize=12)
axes[2].set_ylabel('Safety Score', fontsize=12)
axes[2].legend(title="Severity", bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10, title_fontsize=12)

# Adjusting layout for better alignment and spacing
fig.tight_layout()
plt.subplots_adjust(right=0.85)  # To make space for legends

plt.show()
No description has been provided for this image
In [ ]:
train2 = train2[train2.columns.drop('Severity')]

Feature Scaling¶

In [ ]:
from sklearn import preprocessing
scaler= preprocessing.StandardScaler()
scaled_df= scaler.fit_transform(train2)
scaled_df= pd.DataFrame(scaled_df, columns= ['Safety_Score', 'Control_Metric', 'Turbulence_In_gforces', 'Cabin_Temperature', 'Max_Elevation', 'Adverse_Weather_Metric', 'Total_Safety_Complaints', 'Days_Since_Inspection', 'Violations','Accident_Type_Code'])
scaled_df.head()
Out[ ]:
Safety_Score Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Adverse_Weather_Metric Total_Safety_Complaints Days_Since_Inspection Violations Accident_Type_Code
0 0.420351 0.454187 -0.921487 -0.700251 -0.070907 0.957591 1.611550 0.301913 0.949826 -0.953875
1 1.194083 0.547996 0.492818 1.650795 -0.634031 0.861217 1.820414 -0.828034 -0.011743 -0.953875
2 1.228768 0.012680 -0.378571 -0.393082 0.770327 -1.549721 1.290592 0.019426 0.949826 1.673713
3 0.353650 0.779369 -0.239031 0.678977 1.141707 0.597230 0.727179 -0.545548 -0.973311 -0.428357
4 -0.908335 -1.358884 1.271467 -1.033508 0.371655 0.504031 1.741727 0.019426 -0.011743 -0.428357
  • The StandardScaler from sklearn.preprocessing is used for standardization.
  • Standardization transforms each feature to have a mean of 0 and a standard deviation of 1, ensuring a common scale without distorting relative relationships.
  • scaled_df.head() outputs the first 5 rows of the standardized data.
  • This allows verification that scaling was applied correctly and data integrity was maintained.
In [ ]:
#Let's check the mean(Should be approximtaley 0) and SD(Ideally 1) of the scaled dataframe
scaled_df.mean()
Out[ ]:
0
Safety_Score -2.212000e-16
Control_Metric -3.321556e-16
Turbulence_In_gforces -6.650225e-17
Cabin_Temperature -9.957556e-16
Max_Elevation 2.933923e-16
Adverse_Weather_Metric 1.422508e-17
Total_Safety_Complaints 1.753241e-16
Days_Since_Inspection -1.792360e-16
Violations 8.463922e-17
Accident_Type_Code -7.823794e-17

In [ ]:
scaled_df.std()
Out[ ]:
0
Safety_Score 1.00005
Control_Metric 1.00005
Turbulence_In_gforces 1.00005
Cabin_Temperature 1.00005
Max_Elevation 1.00005
Adverse_Weather_Metric 1.00005
Total_Safety_Complaints 1.00005
Days_Since_Inspection 1.00005
Violations 1.00005
Accident_Type_Code 1.00005

In [ ]:
#Let's check the distribution of Variables now
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 15))

ax1.set_title('Before Scaling')
sns.kdeplot(df_train['Safety_Score'], ax=ax1)
sns.kdeplot(df_train['Days_Since_Inspection'], ax=ax1)
sns.kdeplot(df_train['Total_Safety_Complaints'], ax=ax1)
sns.kdeplot(df_train['Control_Metric'], ax=ax1)
sns.kdeplot(df_train['Turbulence_In_gforces'], ax=ax1)
sns.kdeplot(df_train['Cabin_Temperature'], ax=ax1)
sns.kdeplot(df_train['Max_Elevation'], ax=ax1)
sns.kdeplot(df_train['Violations'], ax=ax1)
sns.kdeplot(df_train['Adverse_Weather_Metric'], ax=ax1)


ax2.set_title('After Standard Scaler')
sns.kdeplot(scaled_df['Safety_Score'], ax=ax2)
sns.kdeplot(scaled_df['Days_Since_Inspection'], ax=ax2)
sns.kdeplot(scaled_df['Total_Safety_Complaints'], ax=ax2)
sns.kdeplot(scaled_df['Control_Metric'], ax=ax2)
sns.kdeplot(scaled_df['Turbulence_In_gforces'], ax=ax2)
sns.kdeplot(scaled_df['Cabin_Temperature'], ax=ax2)
sns.kdeplot(scaled_df['Max_Elevation'], ax=ax2)
sns.kdeplot(scaled_df['Violations'], ax=ax2)
sns.kdeplot(scaled_df['Adverse_Weather_Metric'], ax=ax2)

plt.show()
No description has been provided for this image
In [ ]:
# Define the number of variables and create subplots
num_vars = testing2.columns
num_plots = len(num_vars)
rows = (num_plots + 1) // 2  # Arrange in a grid with 2 plots per row
fig, axes = plt.subplots(rows, 2, figsize=(14, rows * 4))  # Adjust size dynamically
axes = axes.flatten()  # Flatten the 2D array of axes for easier indexing

# Plot each variable in its respective subplot
for i, var in enumerate(num_vars):
    sns.histplot(testing2[var], kde=True, color="dodgerblue", ax=axes[i])  # Use histplot with KDE
    axes[i].set_title(f"Distribution of {var}", fontsize=14, pad=10)
    axes[i].set_xlabel(var, fontsize=12)
    axes[i].set_ylabel("Frequency", fontsize=12)

# Hide any unused subplots (if the number of variables is odd)
for j in range(num_plots, len(axes)):
    axes[j].set_visible(False)

# Adjust layout to remove unwanted spaces
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

It's quite clear that the data is now normally distributed with a mean 0 and a standard devaition of 1. Let's apply the same transformation to Test Data before proceeding with Model fitting

In [ ]:
#Applying transformations
testing2['Total_Safety_Complaints'] = np.log(testing2['Total_Safety_Complaints']+1)
testing2['Adverse_Weather_Metric'] = np.log(testing2['Adverse_Weather_Metric']+1)
testing2['Cabin_Temperature'] = np.log(testing2['Cabin_Temperature']+1)
testing2['Turbulence_In_gforces'] = np.log(testing2['Turbulence_In_gforces']+1)

#Fixing left skew
testing2['Control_Metric'] = np.power(testing2['Control_Metric'], 2)
In [ ]:
testing2.head()
Out[ ]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric Accident_ID
0 19.497717 16 1.945910 5205.813236 0.328554 4.373490 37949.724386 2 0.067371 1
1 58.173516 15 1.386294 4171.252251 0.223816 4.377014 30194.805567 2 0.002774 10
2 33.287671 15 1.386294 4188.933272 0.290180 4.476882 17572.925484 1 0.004307 14
3 3.287671 21 1.791759 4404.022240 0.351906 4.405010 40209.186341 2 0.182314 17
4 10.867580 18 1.098612 3148.058972 0.272488 4.384773 35495.525408 2 0.394536 21
In [ ]:
ID_Col= testing2[['Accident_ID']]
testing_df= testing2.drop(['Accident_ID'], axis=1)
testing_df.head()
Out[ ]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric
0 19.497717 16 1.945910 5205.813236 0.328554 4.373490 37949.724386 2 0.067371
1 58.173516 15 1.386294 4171.252251 0.223816 4.377014 30194.805567 2 0.002774
2 33.287671 15 1.386294 4188.933272 0.290180 4.476882 17572.925484 1 0.004307
3 3.287671 21 1.791759 4404.022240 0.351906 4.405010 40209.186341 2 0.182314
4 10.867580 18 1.098612 3148.058972 0.272488 4.384773 35495.525408 2 0.394536
In [ ]:
#Standardization
scaler= preprocessing.StandardScaler()
scaled_df_test= scaler.fit_transform(testing_df)
scaled_df_test= pd.DataFrame(scaled_df_test, columns= ['Safety_Score', 'Days_Since_Inspection','Total_Safety_Complaints', 'Control_Metric', 'Turbulence_In_gforces', 'Cabin_Temperature', 'Max_Elevation', 'Violations', 'Adverse_Weather_Metric'])
scaled_df_test.head()
Out[ ]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric
0 -1.371727 0.866845 0.355919 0.542584 0.153563 -0.614333 0.586995 0.009034 -0.483246
1 1.004384 0.582969 -0.232222 -0.157369 -1.111490 -0.507807 -0.230758 0.009034 -0.741989
2 -0.524519 0.582969 -0.232222 -0.145406 -0.309925 2.511257 -1.561731 -0.972910 -0.735846
3 -2.367618 2.286227 0.193911 0.000116 0.435613 0.338538 0.825254 0.009034 -0.022850
4 -1.901934 1.434598 -0.534568 -0.849630 -0.523613 -0.273256 0.328201 0.009034 0.827197

Model Building¶

Feature Selection¶

In [ ]:
# Rearrange train data
train_df= scaled_df[['Safety_Score', 'Days_Since_Inspection','Total_Safety_Complaints', 'Control_Metric', 'Turbulence_In_gforces', 'Cabin_Temperature', 'Max_Elevation', 'Violations', 'Adverse_Weather_Metric']]
train_df.head()
Out[ ]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric
0 0.420351 0.301913 1.611550 0.454187 -0.921487 -0.700251 -0.070907 0.949826 0.957591
1 1.194083 -0.828034 1.820414 0.547996 0.492818 1.650795 -0.634031 -0.011743 0.861217
2 1.228768 0.019426 1.290592 0.012680 -0.378571 -0.393082 0.770327 0.949826 -1.549721
3 0.353650 -0.545548 0.727179 0.779369 -0.239031 0.678977 1.141707 -0.973311 0.597230
4 -0.908335 0.019426 1.741727 -1.358884 1.271467 -1.033508 0.371655 -0.011743 0.504031
In [ ]:
# Check if it's same as original scaled dataframe
scaled_df.head()
Out[ ]:
Safety_Score Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Adverse_Weather_Metric Total_Safety_Complaints Days_Since_Inspection Violations Accident_Type_Code
0 0.420351 0.454187 -0.921487 -0.700251 -0.070907 0.957591 1.611550 0.301913 0.949826 -0.953875
1 1.194083 0.547996 0.492818 1.650795 -0.634031 0.861217 1.820414 -0.828034 -0.011743 -0.953875
2 1.228768 0.012680 -0.378571 -0.393082 0.770327 -1.549721 1.290592 0.019426 0.949826 1.673713
3 0.353650 0.779369 -0.239031 0.678977 1.141707 0.597230 0.727179 -0.545548 -0.973311 -0.428357
4 -0.908335 -1.358884 1.271467 -1.033508 0.371655 0.504031 1.741727 0.019426 -0.011743 -0.428357
In [ ]:
#Put into X and y arrays
X= train_df
y= df_train['Severity']

Train-Test Split¶

In [ ]:
#Split into train and validation sets
X_train, X_Val, y_train, y_Val= train_test_split(X, y, test_size=0.2, random_state=20)
print("shape of training data:", X_train.shape, "\nShape of Validation data:", X_Val.shape, "\nShape of training label:", y_train.shape, "\nShape of Validation label:", y_Val.shape)
shape of training data: (7992, 9) 
Shape of Validation data: (1998, 9) 
Shape of training label: (7992,) 
Shape of Validation label: (1998,)
  • Proportion: The dataset is successfully split into 80% training and 20% validation subsets.
  • Consistency: Shapes of features and labels align correctly between training and validation sets.
  • Next Steps: This split allows the model to be trained on X_train and y_train and evaluated on X_Val and y_Val.
In [ ]:
X_train.head()
Out[ ]:
Safety_Score Days_Since_Inspection Total_Safety_Complaints Control_Metric Turbulence_In_gforces Cabin_Temperature Max_Elevation Violations Adverse_Weather_Metric
113 -0.353382 -0.545548 0.490248 1.957960 -1.921152 -0.535029 -1.049970 -0.011743 0.262105
6341 -0.265336 -0.828034 0.348466 1.398269 -1.602205 0.105733 -1.696070 -0.973311 1.127956
104 -1.273857 0.584400 -0.981700 -0.802370 1.165562 -1.079246 0.496164 -0.973311 0.007698
1698 -0.961696 0.301913 0.615308 1.671317 -0.657895 0.811617 -0.242692 -0.011743 1.443067
2586 0.417683 -0.545548 -0.981700 1.288584 -0.661790 0.822345 -0.599910 -0.973311 0.349731
In [ ]:
y_train.head()
Out[ ]:
Severity
114 4
6350 4
105 2
1704 4
2592 3

Model Training¶

Baseline Models¶

This experiment involves training and evaluating three machine learning models: Random Forest, XGBoost, and a Neural Network.

In [ ]:
# prompt: use random forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize and train the Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)  # You can adjust hyperparameters
rf_classifier.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = rf_classifier.predict(X_Val)

# Evaluate the model
accuracy = accuracy_score(y_Val, y_pred)
print(f"Random Forest Accuracy: {accuracy}")

#Now you can use the trained model to predict on the test set
#test_predictions = rf_classifier.predict(scaled_df_test)
Random Forest Accuracy: 0.9434434434434434

Random Forest Classifier:

Description:

  • A tree-based ensemble method that combines multiple decision trees to improve performance.
  • n_estimators=100: Uses 100 decision trees.

Accuracy:

  • The validation accuracy was 0.95.

Strengths:

  • Handles both categorical and numerical features well.
  • Robust to overfitting with enough trees.

Inference:

  • Achieved good accuracy on the validation set, making it a strong baseline.
  • May not handle multi-class classification as effectively as specialized methods like XGBoost.
In [ ]:
from sklearn.preprocessing import LabelEncoder

# Encode the labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)  # Convert strings to integers
y_Val_encoded = label_encoder.transform(y_Val)          # Use the same encoding

import xgboost as xgb
from sklearn.metrics import accuracy_score

# Initialize and train the XGBoost Classifier
xgb_classifier = xgb.XGBClassifier(objective='multi:softmax', num_class=4, random_state=42)
xgb_classifier.fit(X_train, y_train_encoded)

# Make predictions on the validation set
y_pred_xgb = xgb_classifier.predict(X_Val)

# Decode predictions back to original labels if necessary
y_pred_xgb_decoded = label_encoder.inverse_transform(y_pred_xgb)
y_Val_decoded = label_encoder.inverse_transform(y_Val_encoded)

# Evaluate the model
accuracy_xgb = accuracy_score(y_Val_decoded, y_pred_xgb_decoded)
print(f"XGBoost Accuracy: {accuracy_xgb}")
XGBoost Accuracy: 0.953953953953954

XGBoost Classifier:

Description:

  • A gradient-boosting algorithm optimized for speed and performance.
  • objective='multi:softmax': Used for multi-class classification.
  • num_class=4: Specifies four target classes.

Label Encoding:

  • Categorical labels were encoded into integers using LabelEncoder.
  • Predictions were decoded back to original labels for evaluation.

Accuracy:

  • Validation accuracy was 0.95.

Strengths:

  • Often achieves superior performance on structured data.
  • Built-in handling of multi-class tasks.

Inference:

  • Likely achieved better accuracy than Random Forest due to boosting and optimization for multi-class problems.
  • Well-suited for this problem if computation time is not a concern.

Neural Network¶

In [ ]:
import tensorflow as tf
from tensorflow.keras import layers, regularizers
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_Val_scaled = scaler.transform(X_Val)
# Simplified and Optimized Neural Network Architecture
model = tf.keras.Sequential([


    layers.Dense(16, activation='relu'),


    layers.Dense(8, activation='relu'),

    layers.Dense(4, activation='softmax')  # Output layer for multi-class classification
])

# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
history = model.fit(X_train_scaled, pd.get_dummies(y_train).values,
                    epochs=30,
                    batch_size=16,
                    validation_data=(X_Val_scaled, pd.get_dummies(y_Val).values))

# Evaluate the model
loss, accuracy = model.evaluate(X_Val_scaled, pd.get_dummies(y_Val).values)
print(f"Neural Network Accuracy: {accuracy}")
Epoch 1/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 4s 6ms/step - accuracy: 0.3004 - loss: 1.3861 - val_accuracy: 0.4530 - val_loss: 1.2413
Epoch 2/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.5120 - loss: 1.1550 - val_accuracy: 0.6657 - val_loss: 0.9281
Epoch 3/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.6863 - loss: 0.8974 - val_accuracy: 0.7302 - val_loss: 0.7756
Epoch 4/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.7548 - loss: 0.7631 - val_accuracy: 0.7748 - val_loss: 0.6654
Epoch 5/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.8076 - loss: 0.6305 - val_accuracy: 0.8093 - val_loss: 0.5764
Epoch 6/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.8244 - loss: 0.5913 - val_accuracy: 0.8433 - val_loss: 0.5060
Epoch 7/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.8607 - loss: 0.5113 - val_accuracy: 0.8809 - val_loss: 0.4325
Epoch 8/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.8937 - loss: 0.4155 - val_accuracy: 0.9059 - val_loss: 0.3730
Epoch 9/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9142 - loss: 0.3724 - val_accuracy: 0.9134 - val_loss: 0.3294
Epoch 10/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9235 - loss: 0.3198 - val_accuracy: 0.9159 - val_loss: 0.3059
Epoch 11/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9311 - loss: 0.3059 - val_accuracy: 0.9184 - val_loss: 0.2949
Epoch 12/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9251 - loss: 0.3089 - val_accuracy: 0.9184 - val_loss: 0.2837
Epoch 13/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9309 - loss: 0.2885 - val_accuracy: 0.9239 - val_loss: 0.2715
Epoch 14/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9302 - loss: 0.2710 - val_accuracy: 0.9219 - val_loss: 0.2680
Epoch 15/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - accuracy: 0.9262 - loss: 0.2770 - val_accuracy: 0.9284 - val_loss: 0.2612
Epoch 16/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9318 - loss: 0.2631 - val_accuracy: 0.9284 - val_loss: 0.2531
Epoch 17/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9360 - loss: 0.2615 - val_accuracy: 0.9264 - val_loss: 0.2456
Epoch 18/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9346 - loss: 0.2492 - val_accuracy: 0.9209 - val_loss: 0.2463
Epoch 19/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9424 - loss: 0.2297 - val_accuracy: 0.9264 - val_loss: 0.2415
Epoch 20/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9416 - loss: 0.2216 - val_accuracy: 0.9264 - val_loss: 0.2379
Epoch 21/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9384 - loss: 0.2462 - val_accuracy: 0.9239 - val_loss: 0.2357
Epoch 22/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9325 - loss: 0.2247 - val_accuracy: 0.9314 - val_loss: 0.2287
Epoch 23/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9372 - loss: 0.2434 - val_accuracy: 0.9334 - val_loss: 0.2284
Epoch 24/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 3ms/step - accuracy: 0.9373 - loss: 0.2410 - val_accuracy: 0.9299 - val_loss: 0.2280
Epoch 25/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 2s 2ms/step - accuracy: 0.9336 - loss: 0.2500 - val_accuracy: 0.9339 - val_loss: 0.2227
Epoch 26/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9378 - loss: 0.2238 - val_accuracy: 0.9309 - val_loss: 0.2220
Epoch 27/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9413 - loss: 0.2134 - val_accuracy: 0.9294 - val_loss: 0.2211
Epoch 28/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9334 - loss: 0.2341 - val_accuracy: 0.9304 - val_loss: 0.2204
Epoch 29/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9346 - loss: 0.2240 - val_accuracy: 0.9309 - val_loss: 0.2224
Epoch 30/30
500/500 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9374 - loss: 0.2174 - val_accuracy: 0.9304 - val_loss: 0.2169
63/63 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.9366 - loss: 0.2033
Neural Network Accuracy: 0.9304304122924805

Neural Network:

Description:

  • A feedforward neural network with: Input Layer → 16 neurons → 8 neurons → 4 neurons (output layer for 4 classes with softmax activation).
  • Optimized with the Adam optimizer and categorical_crossentropy for multi-class classification.

Data Scaling:

  • Features were scaled using StandardScaler to ensure the neural network performs optimally.

Training:

  • Used 100 epochs and a batch size of 16.
  • Outputs validation accuracy during training.

Accuracy:

  • Final accuracy was 0.93.

Strengths:

  • Can capture non-linear relationships.
  • Flexible architecture allows customization.

Inference:

  • Neural networks may take longer to train and are prone to overfitting on small datasets.
  • Achieved competitive accuracy but may not outperform XGBoost for this structured data.

Conclusion¶

The project "Predicting Aviation Accident Severity: A Data-Driven Approach to Enhancing Air Travel Safety" is a comprehensive initiative leveraging data science and machine learning to address a critical issue. Below is a detailed conclusion based on the analysis of the provided notebook and its relevance to various aspects:

Data Science and Lifecycle

  • Data Cleaning: The dataset underwent extensive preprocessing, including handling missing values, normalizing columns, and correcting categorical variables, which is critical for reliable model outcomes.

  • Exploratory Data Analysis (EDA): Insights derived from EDA, such as distributions, outliers, and correlations, reveal relationships between features, aiding in feature selection.

  • Feature Engineering: Transformations to address skewness and outlier treatment highlight the importance of preparing data for predictive modeling.

Machine Learning

  • Model Building: Various models were built, including baseline classifiers and a neural network, showcasing the comparative advantage of advanced algorithms over traditional methods.

  • Evaluation Metrics: Metrics such as confusion matrix, accuracy, and loss highlight performance, with the neural network demonstrating better adaptability to complex, nonlinear relationships.

Neural Network

  • Architecture: The use of deep learning indicates a focus on capturing complex patterns within the dataset, particularly for predicting accident severity. Performance: The neural network's ability to outperform baseline models emphasizes its suitability for high-stakes applications like aviation safety.

Project Relevance in Today's World

  • Aviation Safety: Given the potential catastrophic impacts of aviation accidents, this project directly contributes to improving safety protocols by identifying high-risk scenarios.

  • Data-Driven Decision Making: The integration of machine learning into safety assessments aligns with modern trends of leveraging big data for critical decision-making in high-stakes industries.

  • Broader Implications: The methodologies applied in this project could extend to other domains such as healthcare, transportation, and industrial safety, emphasizing its scalability and interdisciplinary impact.

Final Thoughts

This project effectively demonstrates how data science can address real-world challenges. It integrates the entire data science lifecycle with cutting-edge machine learning, creating a robust framework to enhance aviation safety. By focusing on practical applications, it underscores the transformative potential of technology in ensuring safety and efficiency in critical industries